1,065 research outputs found

    Enhanced suffix arrays as language models: Virtual k-testable languages

    Get PDF
    In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited size n. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency information. The approach can also be viewed as a collection of virtual k-testable automata. Once built, we can directly access the results of any k-testable automaton generated from the input training data. Synchronous back- off automatically identies the k-testable automaton with the largest feasible k. We have used this approach in several classification tasks

    Finding patterns in strings using suffix arrays

    Get PDF
    Finding regularities in large data sets requires implementations of systems that are efficient in both time and space requirements. Here, we describe a newly developed system that exploits the internal structure of the enhanced suffixarray to find significant patterns in a large collection of sequences. The system searches exhaustively for all significantly compressing patterns where patterns may consist of symbols and skips or wildcards. We demonstrate a possible application of the system by detecting interesting patterns in a Dutch and an English corpus

    Cue phrase selection methods for textual classification problems

    No full text
    The classification of texts and pieces of texts uses the occurrence of, combinations of, words as an important indicator. Not every word or each combination of words gives a clear indication of the classification of a piece of text. Research has been done on methods that select some words or combinations of words that are more indicative of the type of a piece of text. These words or combinations of words are selected from the words and word-groups as they occur in the texts. These more indicative words or combinations of words we call ¿cue-phrases¿. The goal of these methods is to select the most indicative cue-phrases first. The collection of selected words and/or combinations thereof can then be used for training the classification system. To test these selection methods, a number of experiments has been done on a corpus containing cookbook recipes and on a corpus of four-participant meetings. To perform these experiments, a computer program was written. On the recipe corpus we looked at classifying the sentences into different types. Some examples of these types include ¿requirement¿ and ¿instruction¿. On the four-person meeting corpus we tried to learn, using only lexical features, whether a sentence is addressed to an individual or a group. The experiments on the recipe corpus produced good results that showed that, a number of, the used cue-phrase selection methods are suitable for feature selection. The experiments on the four-person meeting corpus where less successful in terms of performance off the classification task. We did see comparable patterns in selection methods, and considering the results of Jovanovic we can conclude that different features are needed for this particular classification task. One of the original goals was to look at ¿addressee¿ in discussions. Are sentences more often addressed to individuals inside discussions compared to outside discussions? However, in order to be able to accomplish this, we must first identify the segments of the text that are discussions. It proved hard to come to a reliable specification of discussions, and our initial definition wasn¿t sufficient

    Statistical langauge models for alternative sequence selection

    No full text

    Unlocking language archives using search

    Get PDF
    The Language Archive manages one of the largest and most varied sets of natural language data. This data consists of video and audio enriched with annotations. It is available for more than 250 languages, many of which are endangered. Researchers have a need to access this data conveniently and efficiently. We provide several browse and search methods to cover this need, which have been developed and expanded over the years. Metadata and content-oriented search methods can be connected for a more focused search. This article aims to provide a complete overview of the available search mechanisms, with a focus on annotation content search, including a benchmark

    Token merging in language model-based confusible disambiguation

    No full text
    In the context of confusible disambiguation (spelling correction that requires context), the synchronous back-off strategy combined with traditional n-gram language models performs well. However, when alternatives consist of a different number of tokens, this classification technique cannot be applied directly, because the computation of the probabilities is skewed. Previous work already showed that probabilities based on different order n-grams should not be compared directly. In this article, we propose new probability metrics in which the size of the n is varied according to the number of tokens of the confusible alternative. This requires access to n-grams of variable length. Results show that the synchronous back-off method is extremely robust. We discuss the use of suffix trees as a technique to store variable length n-gram information efficiently

    Federated search: Towards a common search infrastructure

    Get PDF
    Within scientific institutes there exist many language resources. These resources are often quite specialized and relatively unknown. The current infrastructural initiatives try to tackle this issue by collecting metadata about the resources and establishing centers with stable repositories to ensure the availability of the resources. It would be beneficial if the researcher could, by means of a simple query, determine which resources and which centers contain information useful to his or her research, or even work on a set of distributed resources as a virtual corpus. In this article we propose an architecture for a distributed search environment allowing researchers to perform searches in a set of distributed language resources

    Microvascular Dysfunction Is Associated With a Higher Incidence of Type 2 Diabetes Mellitus A Systematic Review and Meta-Analysis

    Get PDF
    Objective-Recent data support the hypothesis that microvascular dysfunction may be a potential mechanism in the development of insulin resistance. We examined the association of microvascular dysfunction with incident type 2 diabetes mellitus (T2DM) and impaired glucose metabolism by reviewing the literature and conducting a meta-analysis of longitudinal studies on this topic. Methods and Results-We searched Medline and Embase for articles published up to October 2011. Prospective cohort studies that focused on microvascular measurements in participants free of T2DM a baseline were included. Pooled relative risks were calculated using random effects models. Thirteen studies met the inclusion criteria for this meta-analysis. These studies focused on T2DM or impaired fasting glucose, not on impaired glucose tolerance. The pooled relative risks for incident T2DM (3846 cases) was 1.25 (95% confidence interval, 1.15; 1.36) per 1 SD greater microvascular dysfunction when all estimates of microvascular dysfunction were combined. In analyses of single estimates of microvascular dysfunction, the pooled relative risks for incident T2DM was 1.49 (1.36; 1.64) per 1 SD higher plasma soluble E-selectin levels; 1.21(1.11; 1.31) per 1 SD higher plasma soluble intercellular adhesion molecule-1 levels; 1.48 (1.03; 2.12) per 1 SD lower response to acetylcholine-mediated peripheral vascular reactivity; 1.18 (1.08; 1.29) per 1 SD lower retinal arteriole-to-venule ratio; and 1.43 (1.33; 1.54) per 1 logarithmically transformed unit higher albumin-to-creatinine ratio. In addition, the pooled relative risks for incident impaired fasting glucose (409 cases) was 1.15 (1.01-1.31) per 1 SD greater retinal venular diameters. Conclusion-These data indicate that various estimates of microvascular dysfunction were associated with incident T2DM and, possibly, impaired fasting glucose, suggesting a role for the microcirculation in the pathogenesis of T2DM. (Arterioscler Thromb Vasc Biol. 2012;32:3082-3094.

    Neighbourhood property value and type 2 diabetes mellitus in the Maastricht study: A multilevel study

    Get PDF
    Objective Low individual socioeconomic status (SES) is known to be associated with a higher risk of type 2 diabetes mellitus (T2DM), but the extent to which the local context in which people live may influence T2DM rates remains unclear. This study examines whether living in a low property value neighbourhood is associated with higher rates of T2DM independently of individual SES. Research design and methods Using cross-sectional data from the Maastricht Study (2010\u20132013) and geographical data from Statistics Netherlands, multilevel logistic regression was used to assess the association between neighbourhood property value and T2DM. Individual SES was based on education, occupation and income. Of the 2,056 participants (aged 40\u201375 years), 494 (24%) were diagnosed with T2DM. Results Individual SES was strongly associated with T2DM, but a significant proportion of the variance in T2DM was found at the neighbourhood level (VPC = 9.2%; 95% CI = 5.0%\u201316%). Participants living in the poorest neighbourhoods had a 2.38 times higher odds ratio of T2DM compared to those living in the richest areas (95% CI = 1.58\u20133.58), independently of individual SES. Conclusions Neighbourhood property value showed a significant association with T2DM, suggesting the usefulness of area-based programmes aimed at improving neighbourhood characteristics in order to tackle inequalities in T2DM

    T polymorphism in determining hepatic lipase activity: the Hoorn Study

    Get PDF
    T polymorphism, after adjustment for age, sex, carbohydrate and protein intakes, and insulin or body mass inde
    corecore